feat: add chaos engineering & resilience testing framework#13
Closed
oniani1 wants to merge 5 commits intoCSenshi:mainfrom
Closed
feat: add chaos engineering & resilience testing framework#13oniani1 wants to merge 5 commits intoCSenshi:mainfrom
oniani1 wants to merge 5 commits intoCSenshi:mainfrom
Conversation
Add a Toxiproxy-based chaos testing framework to systematically test how each app handles infrastructure failures (Redis/Postgres/LocalStack outages, latency, bandwidth limits, connection recovery). New shared library: - libs/chaos/ — ToxiproxyClient, ChaosScenario types, resilience report generator, health check helpers, and full unit test suite (11/11 pass) Chaos test suites (17 scenarios total): - url-shortener: 7 scenarios (Redis cache/counter, Postgres, latency, recovery) - rate-limiter: 6 scenarios (fail-open, latency, timeout, flap, bandwidth, recovery) - web-crawler: 4 scenarios (S3 timeout, SQS latency, DynamoDB down, total outage) Infrastructure: - docker-compose.chaos.yml per app with Toxiproxy routing - jest.chaos.config.ts per app (serial execution enforced via maxWorkers: 1) - Nx chaos targets: pnpm nx run @apps/<app>:chaos No existing source files modified — purely additive. Resilience bugs are documented in the generated test report, not auto-fixed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Chaos tests were never validated because Docker wasn't available locally. Add a matrix CI job that runs each app's chaos suite on ubuntu-latest where Docker is available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- runScenario() now re-throws on failure so Jest actually marks tests red - Add ensureProxy() to handle 409 conflicts on re-runs - Move env vars before imports in url-shortener and web-crawler specs (RedisModule.forRoot reads process.env at import time, not compile time) - Add Docker healthchecks to all docker-compose.chaos.yml files (redis-cluster, postgres, redis, localstack) - Add prisma-generate and prisma-deploy steps to CI for url-shortener - Use pnpm exec jest instead of npx jest in CI and package.json targets - Improve web-crawler assertAppAlive to resolve infrastructure providers - Use ensureProxy instead of createProxy in all chaos specs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix ESLint no-inferrable-types errors in wait-for-service.ts and client.ts - Fix ESLint no-empty-function errors in all chaos spec files - Convert jest.chaos.config.ts to .js to avoid ts-node module:nodenext parse failure - Add Toxiproxy health checks (via /toxiproxy-cli) to all docker-compose.chaos.yml - Pre-create Toxiproxy proxies in CI before running Prisma migrations (url-shortener) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix supertest import: `import * as request` → `import request` for supertest 7.x ESM default export (rate-limiter, url-shortener) - Fix rate-limiter adapter: remove FastifyAdapter since app uses Express - Fix rate-limit guard: add try/catch for fail-open when Redis is down, re-throw RateLimitExceededException so 429s still work - Fix url-shortener env var timing: use dynamic require() for AppModule inside beforeAll so REDIS_HOST is set before @module decorator evaluates (SWC hoists static imports before env var assignments) - Fix url-shortener test expectation: accept 201 when Redis counter is down since app has Postgres counter fallback - Fix web-crawler DiscoveryModule: replace full AppModule with focused ChaosTestModule to avoid @ssut/nestjs-sqs → @golevelup/nestjs-discovery incompatibility with NestJS 11 testing - Fix init-localstack.sh CRLF line endings for Linux containers - Fix jest.chaos.config.ts → .js references in all package.json targets - Add .gitattributes to enforce LF for shell scripts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Closing this PR — the chaos engineering framework was an exploratory effort but we've decided to prioritize foundational unit test coverage first (see #14). The chaos testing approach may be revisited in the future once the core test suite is more comprehensive. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
libs/chaos/shared library with Toxiproxy client, chaos scenario types, resilience report generator, and health check helpersdocker-compose.chaos.ymlper app with Toxiproxy routing infrastructurepnpm nx run @apps/<app>:chaosNx targets for each appWhat is this?
A chaos engineering framework using Toxiproxy to systematically test how each app handles infrastructure failures. Toxiproxy sits between each app and its infrastructure, injecting faults (connection drops, latency, bandwidth limits, timeouts) while the test suite verifies graceful degradation.
No educational system design project does this. It teaches the #1 senior-level interview topic: fault tolerance and graceful degradation.
Scenarios
URL Shortener (7): Redis cache down, Postgres down, Redis counter down, Redis latency, Postgres latency, both down, Redis recovery
Rate Limiter (6): Redis down (fail-open check), Redis latency, Redis timeout, connection flap, bandwidth limit, recovery
Web Crawler (4): S3 timeout, SQS latency, DynamoDB down, total LocalStack outage
Key design decisions
fetchmaxWorkers: 1in Jest configs prevents Toxiproxy state corruptionexpose(notports) for backend services, only Toxiproxy ports are host-mappedANNOUNCE_IP, preventing cluster client bypassTest plan
pnpm nx run @libs/chaos:test— 11/11 unit tests passpnpm nx run @libs/chaos:typecheck— passespnpm nx run @apps/rate-limiter:test— passes (chaos specs correctly excluded)pnpm nx format:check)